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Abstract 

Many modern multiclass and multilabel problems are characterized by increas¬ 
ingly large output spaces. For these problems, label embeddings have been shown 
to be a useful primitive that can improve computational and statistical efficiency. 
In this work we utilize a correspondence between rank constrained estimation and 
low dimensional label embeddings that uncovers a fast label embedding algorithm 
which works in both the multiclass and multilabel settings. The result is a random¬ 
ized algorithm for partial least squares, whose running time is exponentially faster 
than naive algorithms. We demonstrate our techniques on two large-scale public 
datasets, from the Large Scale Hierarchical Text Challenge and the Open Direc¬ 
tory Project, where we obtain state of the art results. 


1 Contributions 

We provide a statistical motivation for label embedding by demonstrating that the optimal rank- 
constrained least squares estimator can be constructed from an optimal unconstrained estimator of 
an embedding of the labels. In other words, embedding can provide beneficial sample complexity 
reduction even if computational constraints are not binding. 

We identify a natural object to define label similarity: the expected outer product of the conditional 
label probabilities. In particular, in conjunction with a low-rank constraint, this indicates two label 
embeddings are similar when their conditional probabilities are linearly dependent across the dataset. 
This unifies prior work utilizing the con fusion matrix for multiclass (iBengio et al.L l2010h and the 
empirical label covariance for multilabel (iTai & LinLl2012h . 

We apply techniques from randomized linear algebra (iHalko et al.Ll201 ih to develop an efficient and 
scalable algorithm fo r constructing the embeddin gs, essentially via a novel randomized algorithm for 
partial least squares (iGeladi & KowalsO. Il986l) . Intuitively, this technique implicitly decomposes 
the prediction matrix of a model which would be prohibitively expensive to form explicitly. 

2 Description 

2.1 Notation 

We denote vectors by lowercase letters x, y etc. and matrices by uppercase letters W, Z etc. The 
input dimension is denoted by d, the output dimension by c and the embedding dimension by k. For 
an m X n matrix X € we use 11 Aj for its Frobenius norm, for the pseudoinverse, 

for the projection onto the left singular subspace of X. 

2.2 Proposed Algorithm 

Our proposal is Rembrandt, described in Algorithm[T] We use the top right singular space of IVx,lY 
as a label embedding, or equivalently, the top principal components of Y^Tix,LY (leveraging the 
fact that the projection is idempotent). Using randomized techniques, we can decompose this matrix 
without explicitly forming it, because we can compute the product of with another matrix 

Q via Y^Wx,LyQ = Y^XZ* where Z* = argmin^gRdxcfe+p) II^Q — AZ|||.. Algorithm[T] 
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Algorithm 1 Rembrandt; Response EMBedding via RANDomized Techniques 

1 

function Rembrandt)^, AT e 

,Y € 

2 

(p,g)^(20,l) 

> These hyperparameters rarely need adjustment. 

3 

Q -It- randn(c, k + p) 

> Randomized range finder for Y^Hx^lY 

4 

for j G {1,..., g} do 

5 

Z ^ argmin \\YQ — XZWjp 


6 

Q <r- orthogonalize(y^XX) 


7 

end for 

0 NB: total of {q + 1) data passes, including next line 

8 

F ^ {Y^XQ)^{Y^XQ) 

0 F e R(k+p)x{k+p) “small” 

9 

(U,S2)^eig(F,fc) 

>V G is the embedding 

10 

V^QV 

11 

return (U, S) 


12 

end function 



is a specialization of randomized PCA to this particular form of the matrix multiplication opera¬ 
tor. Fortunately, because {J\x^Ly)k is the optimal rank-co nstrained least squares weig ht matrix, 
Rembrandt is a randomized algorithm for partial least squares dGeladi & Kowalsi^ll986h . 

Algorithm [T] is inexpensive to compute. The matrix vector product YQ is a sparse matrix-vector 
product so complexity 0(nsk) depends only on the average (label) sparsity per example s and 
the embedding dimension k, and is independent of the number of classes c. The ht is done in the 
embedding space and therefore is independent of the number of classes c, and the outer product with 
the predicted embedding is again a sparse product with complexity 0{nsk). The orthogonalization 
step is O(ck^), but this is amortized over the data set and essentially irrelevant as long as n > c. 
Furthermore random projection theory suggests k should grow only logarithmically with c. 

2.3 Experiments 

Table 1: Data sets used for experimentation and times to compute an embedding. Timings are for a 
Matlab implementation on a standard desktop (dual 3.2Ghz Xeon E5-1650 CPU and 48Gb of RAM). 


Dataset 

Type 

Modality 

Examples 

Eeatures 

Classes 

R( 

k 

jmbrandt 
Time (sec) 

ODP 

LSHTC 

Multiclass 

Multilabel 

Text 

Text 

- 1.5M 
-2.4M 

-0.5M 
- 1.6M 

- lOOK 

- 325K 

300 

500 

6,530 

8,006 


Table 2: ODP results, k = 300 for all embedding strategies. RE = Re mbrandt; CS = compressed 
sensin g; PCA = unsupervised (feature) embedding; LT = LomTree dChoromanska & Langfordi 
|2014|) ; “Ah-LR” = logistic regression on representation A. 


Method 

REh-LR 

CS h-LR 

PCA H- LR 

LT 

Test Error 

83.15% 

85.14% 

90.37% 

93.46% 


Table 3: LSHTC results. EastXML and LPSR-NB are from iPrabhu & Varmal ( |20l4 . “Ah-ILR” = 
independent logistic regression on representation A. 


Method 

RE (fc = 800) H- ILR 

RE (/fc = 500) H- ILR 

EastXML 

LPSR-NB 

Precision-at-1 

53.39% 

52.84% 

49.78% 

27.91% 
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